智能论文笔记

Diagnosing BERT with Retrieval Heuristics

Arthur Câmara , Claudia Hauff

分类：自然语言处理

2022-01-12

Word Embeddings于2013年在2013年宣传了Word2Vec，已成为NLP工程管道的主流。最近，随着BERT的发布，Word Embeddings已经从基于术语的嵌入空间移动到上下文嵌入空间 - 每个术语不再由单个低维向量表示，而是每个术语，而是\ \ EMPH {其上下文}。确定矢量权重。 BERT的设置和架构已被证明足以适用于许多自然语言任务。重要的是，对于信息检索（IR），与IR问题的先前深度学习解决方案相比，需要在神经净架构和培训制度的显着调整，“Vanilla BERT”已被证明以广泛的余量优于现有的检索算法，包括任务在传统的IR基线（如Robust04）上有很长的抵抗检索有效性的Corpora。在本文中，我们采用了最近提出的公理数据集分析技术 - 即，我们创建了每个诊断数据集，每个诊断数据集都满足检索启发式（术语匹配和语义） - 探索BERT能够学习的是什么。与我们的期望相比，我们发现BERT，当应用于最近发布的具有ad-hoc主题的大规模Web语料库时，\ emph {否}遵守任何探索的公理。与此同时，BERT优于传统的查询似然检索模型40 \％。这意味着IR的公理方法（及其为检索启发式创建的诊断数据集的扩展）可能无法适用于大型语料库。额外的 - 需要不同的公理。

translated by 谷歌翻译

Constructing Organism Networks from Collaborative Self-Replicators

Steffen Illium , Maximilian Zorn , Cristian Lenta , Michael Kölle , Claudia Linnhoff-Popien , Thomas Gabor

分类：神经与进化计算 | 机器学习

2022-12-20

We introduce organism networks, which function like a single neural network but are composed of several neural particle networks; while each particle network fulfils the role of a single weight application within the organism network, it is also trained to self-replicate its own weights. As organism networks feature vastly more parameters than simpler architectures, we perform our initial experiments on an arithmetic task as well as on simplified MNIST-dataset classification as a collective. We observe that individual particle networks tend to specialise in either of the tasks and that the ones fully specialised in the secondary task may be dropped from the network without hindering the computational accuracy of the primary task. This leads to the discovery of a novel pruning-strategy for sparse neural networks

translated by 谷歌翻译

Empirical Analysis of Limits for Memory Distance in Recurrent Neural Networks

Steffen Illium , Thore Schillman , Robert Müller , Thomas Gabor , Claudia Linnhoff-Popien

分类：机器学习 | 计算机视觉

2022-12-20

Common to all different kinds of recurrent neural networks (RNNs) is the intention to model relations between data points through time. When there is no immediate relationship between subsequent data points (like when the data points are generated at random, e.g.), we show that RNNs are still able to remember a few data points back into the sequence by memorizing them by heart using standard backpropagation. However, we also show that for classical RNNs, LSTM and GRU networks the distance of data points between recurrent calls that can be reproduced this way is highly limited (compared to even a loose connection between data points) and subject to various constraints imposed by the type and size of the RNN in question. This implies the existence of a hard limit (way below the information-theoretic one) for the distance between related data points within which RNNs are still able to recognize said relation.

translated by 谷歌翻译

VoronoiPatches: Evaluating A New Data Augmentation Method

Steffen Illium , Gretchen Griffin , Michael Kölle , Maximilian Zorn , Jonas Nüßlein , Claudia Linnhoff-Popien

分类：计算机视觉 | 机器学习

2022-12-20

Overfitting is a problem in Convolutional Neural Networks (CNN) that causes poor generalization of models on unseen data. To remediate this problem, many new and diverse data augmentation methods (DA) have been proposed to supplement or generate more training data, and thereby increase its quality. In this work, we propose a new data augmentation algorithm: VoronoiPatches (VP). We primarily utilize non-linear recombination of information within an image, fragmenting and occluding small information patches. Unlike other DA methods, VP uses small convex polygon-shaped patches in a random layout to transport information around within an image. Sudden transitions created between patches and the original image can, optionally, be smoothed. In our experiments, VP outperformed current DA methods regarding model variance and overfitting tendencies. We demonstrate data augmentation utilizing non-linear re-combination of information within images, and non-orthogonal shapes and structures improves CNN model robustness on unseen data.

translated by 谷歌翻译

AWT -- Clustering Meteorological Time Series Using an Aggregated Wavelet Tree

Christina Pacher , Irene Schicker , Rosmarie deWit , Katerina Hlavackova-Schindler , Claudia Plant

分类：机器学习

2022-12-13

Both clustering and outlier detection play an important role for meteorological measurements. We present the AWT algorithm, a clustering algorithm for time series data that also performs implicit outlier detection during the clustering. AWT integrates ideas of several well-known K-Means clustering algorithms. It chooses the number of clusters automatically based on a user-defined threshold parameter, and it can be used for heterogeneous meteorological input data as well as for data sets that exceed the available memory size. We apply AWT to crowd sourced 2-m temperature data with an hourly resolution from the city of Vienna to detect outliers and to investigate if the final clusters show general similarities and similarities with urban land-use characteristics. It is shown that both the outlier detection and the implicit mapping to land-use characteristic is possible with AWT which opens new possible fields of application, specifically in the rapidly evolving field of urban climate and urban weather.

translated by 谷歌翻译

Multi-view Graph Convolutional Networks with Differentiable Node Selection

Zhaoliang Chen , Lele Fu , Shunxin Xiao , Shiping Wang , Claudia Plant , Wenzhong Guo

分类：机器学习

2022-12-09

Multi-view data containing complementary and consensus information can facilitate representation learning by exploiting the intact integration of multi-view features. Because most objects in real world often have underlying connections, organizing multi-view data as heterogeneous graphs is beneficial to extracting latent information among different objects. Due to the powerful capability to gather information of neighborhood nodes, in this paper, we apply Graph Convolutional Network (GCN) to cope with heterogeneous-graph data originating from multi-view data, which is still under-explored in the field of GCN. In order to improve the quality of network topology and alleviate the interference of noises yielded by graph fusion, some methods undertake sorting operations before the graph convolution procedure. These GCN-based methods generally sort and select the most confident neighborhood nodes for each vertex, such as picking the top-k nodes according to pre-defined confidence values. Nonetheless, this is problematic due to the non-differentiable sorting operators and inflexible graph embedding learning, which may result in blocked gradient computations and undesired performance. To cope with these issues, we propose a joint framework dubbed Multi-view Graph Convolutional Network with Differentiable Node Selection (MGCN-DNS), which is constituted of an adaptive graph fusion layer, a graph learning module and a differentiable node selection schema. MGCN-DNS accepts multi-channel graph-structural data as inputs and aims to learn more robust graph fusion through a differentiable neural network. The effectiveness of the proposed method is verified by rigorous comparisons with considerable state-of-the-art approaches in terms of multi-view semi-supervised classification tasks.

translated by 谷歌翻译

MedalCare-XL: 16,900 healthy and pathological 12 lead ECGs obtained through electrophysiological simulations

Karli Gillette , Matthias A. F. Gsell , Claudia Nagel , Jule Bender , Bejamin Winkler , Steven E. Williams , Markus Bär , Tobias Schäffter , Olaf Dössel , Gernot Plank

分类：机器学习

2022-11-29

Mechanistic cardiac electrophysiology models allow for personalized simulations of the electrical activity in the heart and the ensuing electrocardiogram (ECG) on the body surface. As such, synthetic signals possess known ground truth labels of the underlying disease and can be employed for validation of machine learning ECG analysis tools in addition to clinical signals. Recently, synthetic ECGs were used to enrich sparse clinical data or even replace them completely during training leading to improved performance on real-world clinical test data. We thus generated a novel synthetic database comprising a total of 16,900 12 lead ECGs based on electrophysiological simulations equally distributed into healthy control and 7 pathology classes. The pathological case of myocardial infraction had 6 sub-classes. A comparison of extracted features between the virtual cohort and a publicly available clinical ECG database demonstrated that the synthetic signals represent clinical ECGs for healthy and pathological subpopulations with high fidelity. The ECG database is split into training, validation, and test folds for development and objective assessment of novel machine learning algorithms.

translated by 谷歌翻译

A New Graph Node Classification Benchmark: Learning Structure from Histology Cell Graphs

Claudia Vanea , Jonathan Campbell , Omri Dodi , Liis Salumäe , Karen Meir , Drorith Hochner-Celnikier , Hagit Hochner , Triin Laisk , Linda M. Ernst , Cecilia M. Lindgren

分类：机器学习 | 计算机视觉

2022-11-11

We introduce a new benchmark dataset, Placenta, for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. This problem is uniquely challenging for graph learning for a few reasons. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure). Here, we release a dataset consisting of two cell graphs from two placenta histology images totalling 2,395,747 nodes, 799,745 of which have ground truth labels. We present inductive benchmark results for 7 scalable models and show how the unique qualities of cell graphs can help drive the development of novel graph neural network architectures.

translated by 谷歌翻译

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

Juan Zuluaga-Gomez , Karel Veselý , Igor Szöke , Petr Motlicek , Martin Kocour , Mickael Rigault , Khalid Choukri , Amrutha Prasad , Seyyed Saeed Sarfjoo , Iuliia Nigmatulina

分类：自然语言处理 | 人工智能

2022-11-08

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

translated by 谷歌翻译

Retrospectives on the Embodied AI Workshop

Matt Deitke , Dhruv Batra , Yonatan Bisk , Tommaso Campari , Angel X. Chang , Devendra Singh Chaplot , Changan Chen , Claudia Pérez D'Arpino , Kiana Ehsani , Ali Farhadi

分类：计算机视觉

2022-10-13

We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.

translated by 谷歌翻译